1 Workshop introduction

In this learning module, we will be working with a 16S rRNA dataset to explore microbial diversity within a sample. To accomplish this, we’ll use RStudio in combination with several specialized packages tailored for microbiome data analysis. Our primary objectives are to identify the bacterial species present in the sample and to quantify the abundance of each species. Each sequence read in the dataset represents an observation, which we will process and analyze to gain insights into the microbial community structure.

Analyzing 16S rRNA data involves several critical steps to ensure the results are accurate and meaningful. One of the most important steps is to differentiate biologically relevant sequences—those that correspond to actual bacterial species—from technical noise, such as sequencing errors. Sequencing technologies can introduce small errors or artefacts, which, if not addressed, can lead to incorrect interpretations of microbial diversity and abundance. Thus, careful data processing, including quality filtering and error correction, is essential to minimize noise and enhance data reliability.

Once we’ve filtered out noise, we’ll proceed with taxonomic assignment, where we match sequence data to known bacterial species or taxa. This step often involves using reference databases and classification algorithms that can accurately categorize bacterial sequences. The final step will be to quantify the presence of each bacterial species, providing a snapshot of the microbial community composition in the sample.

Throughout this module, we’ll discuss and apply each of these analytical steps in detail, helping to build a solid foundation in microbiome data analysis and the principles of bioinformatics.

2 R working environment

This workshop can be perfomed on any machine running Rstudio. However might require custom adaptations and will cost a lot of time. Therefore we have setup a dedicated server which runs Rstudio with all requirements preinstalled.

You can gain acces to the server trough your browser:

http://3.78.246.83/ name: name_lastname pass: SHIBIS

You will need to download the associated files and rmarkdown file which was used to generate this tutorial by cloning the associated github repository. You can easily do this from within the Rstudio interface.

Projects => Version control => Git => https://github.com/AMCMC/SH_IBIS_Day1

Open the rmd file and sequentially run the individual code chuncks.

3 Introduction 16S rRNA amplicon dataset

Amplicon sequencing is the most commonly used approach to determine the microbiome composition of a particular environment. Specific primers targeting universal genes are used to amplify phylogenetic divergent marker sequences. The obtained amplified sequences (amplicons) can than be sequenced using various platforms. The main task when analyzing these datasets is to determine which are true biological sequences and which are technical noise.

Fig 1. Single step 16S rRNA gene amplicon Illumina library preparation

Fig 1. Single step 16S rRNA gene amplicon Illumina library preparation

In this practical session, we will analyze 16S rRNA amplicon sequences obtained from a mock microbial community. These amplicons were generated using universal primers that target the V3V4 regions of the 16S rRNA gene and were sequenced on an Illumina MiSeq platform (2x251) using V3 chemistry. The mock sample was artificially constructed from DNA of 55 unique bacterial strains (Table 1 - MC3). Because we know the true composition of this sample, it serves as a benchmark for evaluating the accuracy of our wet-lab protocols and bioinformatics pipelines.

During this workshop, we will apply the DADA2 software package to infer the taxonomic composition of the data and compare it to the known composition. DADA2 is specifically designed to identify the true amplicon sequence variants (ASVs) present in a sample, providing high-resolution insights into microbial diversity. Part of this workshop is based on the tutorial available on the DADA2 website.

Other packages we will use include:

4 Configure R environment and get data

All dependancies are preinstalled in the server, however if you wish to run it on your own machine, the following code will install everything required locally.

.cran_packages <- c("ggplot2", "reshape2", "stringr","phangorn","DT","ape","ggrepel","knitr","Hmisc")
.bioc_packages <- c("dada2", "phyloseq","ShortRead","DECIPHER","ggtree")
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE, quietly = T)
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:ape':
## 
##     zoom
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:phyloseq':
## 
##     distance
## The following object is masked from 'package:grDevices':
## 
##     windows
## 
## Attaching package: 'Biostrings'
## The following objects are masked from 'package:Hmisc':
## 
##     mask, translate
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:phyloseq':
## 
##     sampleNames
## The following object is masked from 'package:Hmisc':
## 
##     contents
## 
## Attaching package: 'ShortRead'
## The following object is masked from 'package:Hmisc':
## 
##     zoom
## The following object is masked from 'package:ape':
## 
##     zoom
## ggtree v3.10.0 For help: https://yulab-smu.top/treedata-book/
## 
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
## 
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## 
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56
## 
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:Biostrings':
## 
##     collapse
## The following object is masked from 'package:IRanges':
## 
##     collapse
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## The following object is masked from 'package:ape':
## 
##     rotate
##   ggplot2  reshape2   stringr  phangorn        DT       ape   ggrepel     knitr 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
##     Hmisc     dada2  phyloseq ShortRead  DECIPHER    ggtree 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE
theme_set(theme_bw())

5 Raw fastq preprocessing

The initial part of the DADA2 workflow consists of pre-processing of the raw fastq files. Data quality is assessed, and several pipeline parameters can be tweaked to optimize the workflow.

DADA2 paired-end read handling is somewhat quirky, the forward and reverse reads are processed independently and only at the end of the workflow are they merged into a single representative sequence.

5.1 Plot Read quality

First have a look at the size and quality of the data.

FQF <- "./Raw_data/L3_MOCK1.F.fastq.gz"
FQR <- "./Raw_data/L3_MOCK1.R.fastq.gz"
plotQualityProfile(c(FQF,FQR))

The sample contains a total of 74,333 paired-end reads. Additionally, the quality of the reverse reads is lower than that of the forward reads.

5.2 Expected error rate

A more intuitive way to assess read quality is by examining the expected error rate. The following code reads the Phred quality scores from the FASTQ files and calculates the overall expected error for each read. To simplify, we’ll sample 1,000 reads and plot the cumulative error rate along the length of each read. The quality score is a log-transformed measure representing the likelihood that a given base call is incorrect as reported by the Illumina basecalling software.

# get the expected error rates and generate the data.frame
cumelative_error_F <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQF)[1:1000])))/10),1,cumsum)
cumelative_error_R <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQR)[1:1000])))/10),1,cumsum)

# convert the data into long format (required for ggplot2)
cumelative_error_F_long <- data.frame(reshape2::melt(cumelative_error_F), read="Forward")
cumelative_error_R_long <- data.frame(reshape2::melt(cumelative_error_R), read="Reverse")
df.long <- rbind(cumelative_error_F_long, cumelative_error_R_long)
#df.long$Var1 <- factor(ceiling(df.long$Var1/10)*10) # bin the cycles into ba

ggplot(df.long, aes(x=Var1, y=value)) + 
  labs(y="Cumulative expected error rate", x="Cycle number") + 
  facet_wrap(~read) + 
  geom_smooth(stat = 'summary', color = 'red', fill = 'red', alpha = 0.2, 
                fun.data = median_hilow, fun.args = list(conf.int = 0.9)) + 
  geom_smooth(stat = 'summary', color = 'blue', fill = 'blue', alpha = 0.2, 
                fun.data = median_hilow, fun.args = list(conf.int = 0.50)) +
  #scale_y_log10() + 
  geom_hline(yintercept = 1.35)

ee <- 1.35
sum(c(rowMedians(cumelative_error_F)<ee),rowMedians(cumelative_error_R)<ee)
## [1] 454
sum(rowMedians(cumelative_error_F)<ee)
## [1] 238
sum(rowMedians(cumelative_error_R)<ee)
## [1] 216

This analysis reveals that very few sequences are error-free, with the error rate rising almost exponentially across the read length. Based on data quality, we can anticipate an average of 2 errors per forward read and 3 per reverse read. This high error rate results in a significant number of technical sequence variants.

Historically, this issue was addressed by clustering sequences within an arbitrary similarity threshold, often set at 97%. For an amplicon of 430 base pairs, this would mean grouping sequences with up to 12 positional differences into a single representative cluster, known as an Operational Taxonomic Unit (OTU). However, over the past five years, we have increasingly adopted denoising algorithms, which enable us to distinguish between biologically relevant sequences and technical noise with greater accuracy.

5.3 Read duplication

Assuming that most errors are random, we can infer that these errors will lead to unique, singleton sequences, while sequences that are observed multiple times are more likely to represent true variants. DADA2 requires at least two observations of a sequence for it to be considered a true variant. To accomplish this, we use the derepFastq function from DADA2 to dereplicate both the forward and reverse reads.

drr <- derepFastq(FQF) # derep forward
drfut <- derepFastq(FQR) # derep reverse

Let’s examine the statistics for the forward read. The drr object contains the results of the dereplication process for the forward read. Within this derep class object, there is a map that indicates which sequences are grouped together (drr$map).

drr # check object
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 59118 unique sequences
##   Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension:  59118 251
##   Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences:  53754 5696 9443 25284 17014 ...
length(drr$map) # number of unique sequences
## [1] 74333
table(table(drr$map)) # distribution of abundances of these sequences
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 56871  1200   387   184   121    58    41    25    31    19    16    12     8 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
##    14     9    11     5     3     3     5     4     9     2     1     2     2 
##    27    28    29    30    31    32    33    34    35    36    37    38    39 
##     2     1     1     2     3     1     1     3     1     2     2     3     1 
##    40    41    42    47    48    51    56    57    61    68    70    71    73 
##     2     2     2     1     3     1     1     1     1     1     1     1     2 
##    74    76    79    81    83    89    91    92    98   101   103   105   107 
##     1     1     1     1     1     1     1     1     1     1     1     1     1 
##   109   113   116   120   150   152   156   169   209   219   268   305   370 
##     1     2     1     1     2     1     1     1     1     1     1     1     1 
##   387   459   541   975  1533 
##     1     1     1     1     1

As shown, out of the 74,333 forward reads, a total of 56,871 (76%) are unique and occur only once. This presents a challenge for amplicon inference using DADA2, as ASVs can only be reliably inferred from sequences that are observed more than once.

What about the reverse read

5.4 Read trimming

Fortunately, most of the errors are concentrated at the ends of the reads, allowing us to trim the lower-quality regions. By removing these poor-quality bases, we can reduce the number of unique sequences, which in turn increases the likelihood of detecting more true variants.

The V3V4 amplicons range in length from 400 to 430 base pairs, while the total sequencing read length is 2x251 bases. This results in a 70-100 base pair overlap between the forward and reverse reads. To successfully merge the forward and reverse reads downstream, we require a minimum 20 base pair overlap. Therefore, the combined read length (forward + reverse) must be at least 450 base pairs.

What would be appropriate read trim length? Hint: check the cumulative expected error rate plot above

We can trim and filter our reads using the dada2 filterAndTrim function.

length_trimmed_forward=240 # not optimal, check the error profiles for optimal trimming
length_trimmed_reverse=240 # not optimal, check the error profiles for optimal trimming

In contrast to the DADA2 tutorial, we do not filter based on expected errors. [Research]((https://pubmed.ncbi.nlm.nih.gov/31945086/) has shown that sequence quality can vary between different amplicons, and applying quality filtering in such cases may introduce bias, ultimately skewing the final taxonomic composition.

# filtered output files
FQFF <- "FQFF.gz"
FQRF <- "FQRF.gz"

out <- filterAndTrim(FQF, FQFF, FQR, FQRF,
                     truncLen=c(length_trimmed_forward,length_trimmed_reverse),
                     maxN=0, truncQ=0, rm.phix=F, compress=TRUE)
out
##                     reads.in reads.out
## L3_MOCK1.F.fastq.gz    74333     74333

NOTE: Individual statistics will change when different filtering criteria are used! Differences will indicate if you made a better choise for trimming

Check the duplication rate after trimming.

drf <- derepFastq(FQFF)
drf
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 55655 unique sequences
##   Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension:  55655 240
##   Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences:  50532 5878 9482 24071 16532 ...
sum(table(drf$map)==1)
## [1] 53128
drr <- derepFastq(FQRF)
drr
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 65323 unique sequences
##   Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension:  65323 240
##   Consensus quality scores: min=7, median=32, max=38
## $map: Map from reads to unique sequences:  36597 234 65199 23704 29362 ...
sum(table(drr$map)==1)
## [1] 63749

What happened tot the expected error rate and read duplication?

5.5 Calculation of the batch specific error

The second step in the process is the estimation of the batch specific error rates. Use of the quality score to test the validity of a sequence is what sets dada2 apart from the other popular ASV inference methods like Deblur and UNOISE3.

Error estimation is somewhat computational demanding and therefore pre-calculated error rates for this particular sequence run are supplied. (Otherwise error rates can be obtained using ?learnErrors). The following plots can be used to asses the error models. They should show linear decrease in substitution rates over increasing quality scores.

errF <- readRDS("./Raw_data/L3_dada2.errF.RDS")
plotErrors(errF, nominalQ=TRUE)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

errR <- readRDS("./Raw_data/L3_dada2.errR.RDS")
plotErrors(errR, nominalQ=TRUE)
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

Points show the quality score dependent substitution rates while the red line shows the expected rates.

What do you observe regarding the expected vs the real error frequency?

6 ASV Inference

The third step in the dada2 workflow is the actual inference of the ASVs using the dereplicated samples and the corresponding error models. For this we call the dada function.

asv.f <- dada(derep = drf, err = errF)
## Sample 1 - 74333 reads in 55655 unique sequences.
asv.f
## dada-class: object describing DADA2 denoising results
## 217 sequence variants were inferred from 55655 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
asv.r <- dada(derep = drr, err = errR)
## Sample 1 - 74333 reads in 65323 unique sequences.
asv.r
## dada-class: object describing DADA2 denoising results
## 70 sequence variants were inferred from 65323 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
table(is.na(asv.f$map[drf$map]),is.na(asv.r$map[drr$map]))
##        
##         FALSE  TRUE
##   FALSE 65507  4889
##   TRUE   2766  1171
sum(is.na(asv.f$map[drf$map]) | is.na(asv.r$map[drr$map]))
## [1] 8826

If you didnt change the trimming parameters DADA2 will have inferred 217 amplicon sequences from the forward reads and 70 from the reverse reads. Furthermore DADA2 did not find an suitable ASV representative for 18544 (25%) of the reads.

NOTE: asv.f\(map shows which dereplicated sequences map to a particular ASV, and asv.f\)map[drf$map] would give the mapping for each of the oringal sequences in the fastq (not dereplicated)

How do the forward and reverse read differ in performance? How could trimming differently improve these statistics?

6.1 read to ASV mapping

Traditional OTU clustering methods rely on fixed sequence distances.

Now, let’s examine the read-to-ASV mapping. We will select an arbitrary forward ASV and trace which sequences map to it. Both the derep and DADA2 class objects contain mapping files. First, let’s retrieve the reads from the FASTQ file.

x=24 # arbitrairy ASV ID

asv.f$denoised[x] # ASV Sequence
## TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGAAAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGATGGGCAAGTCTGATGTGAAAACCCGGGGCTCAA 
##                                                                                                                                                                                                                                              485
F_ASV_4.FQ <- Biostrings::readQualityScaledDNAStringSet(FQFF)[drf$map %in% which(asv.f$map %in% x)] # reads mapping to forward ASV x
## Warning in XStringSet("DNA", x, start = start, end = end, width = width, :
## metadata columns on input DNAStringSet object were dropped

In total 485 reads map to this ASV.

Let check the divergence of the reads with the actual ASV. When we align the reads with the ASV we can calculate the hamming distance (# positions different). The hamming distance should represent the true sequencing error for each read.

read_asv_ham_dist <- nwhamming(as.character(F_ASV_4.FQ), asv.f$sequence[x])
summary(read_asv_ham_dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   5.064   7.000  34.000
barplot(table(read_asv_ham_dist)); abline(v = 240*0.03, col="red") #240*0.03 = 7.2 ~ 97% similarity @ 240 basepairs

sum(read_asv_ham_dist>7.2)
## [1] 119

As we can see most of the reads have at least one error and roughly 119 (20%) of the reads have more than 3% error rate.

If we take this particular subset of sequences and would run a OTU clustering based approach, what would the results look like?

Lets assume this was the sole bacterium in our sample and the selected reads was our limited dataset. What would an OTU clustering approach tell us?

First, we would get a distance matrix for all sequences. Here we will align all vs all reads and calculate the hamming distance (nwhamming). The code is very inefficient and just for show casing the sequence variance within the ASV. However with ever increasing datasets sizes, this was becoming a limiting factor clustering based approaches.

seq_abun <- data.frame(abundance=table(as.character(F_ASV_4.FQ)))
seq_abun$hamming_distance <- nwhamming(as.character(seq_abun$abundance.Var1), asv.f$sequence[x])
asv.dist.mat <- lapply(as.character(seq_abun$abundance.Var1), function(x) nwhamming(as.character(seq_abun$abundance.Var1), x))

We can visualize a clustering based approach by generating a distance tree for these particular reads. Cluster the sequences in a hierarchical tree and color the tips according to the OTU cluster.

simmat <- do.call(rbind, asv.dist.mat)/430
tree <- hclust(as.dist(simmat))
OTU <- cutree(tree = tree, h = 0.03) # ~97% similarity
aggregate(seq_abun$abundance.Freq, by=list(OTU), FUN=sum) # count the number or reads per "OTU"
##    Group.1   x
## 1        1   1
## 2        2   1
## 3        3   1
## 4        4   1
## 5        5   1
## 6        6   1
## 7        7 285
## 8        8   1
## 9        9   1
## 10      10   1
## 11      11   8
## 12      12   3
## 13      13   1
## 14      14   1
## 15      15   4
## 16      16  17
## 17      17   3
## 18      18   2
## 19      19   9
## 20      20   2
## 21      21   5
## 22      22   1
## 23      23   5
## 24      24   1
## 25      25   1
## 26      26   1
## 27      27   1
## 28      28   1
## 29      29   4
## 30      30  15
## 31      31   1
## 32      32   3
## 33      33   1
## 34      34   1
## 35      35   1
## 36      36   1
## 37      37   1
## 38      38   1
## 39      39   7
## 40      40   1
## 41      41   1
## 42      42   1
## 43      43   1
## 44      44   1
## 45      45  10
## 46      46   1
## 47      47   1
## 48      48   3
## 49      49   1
## 50      50   5
## 51      51   3
## 52      52  11
## 53      53   1
## 54      54   1
## 55      55   1
## 56      56   2
## 57      57   1
## 58      58   1
## 59      59   1
## 60      60   1
## 61      61   2
## 62      62   2
## 63      63   1
## 64      64   1
## 65      65   1
## 66      66   1
## 67      67   2
## 68      68   1
## 69      69   1
## 70      70   1
## 71      71   1
## 72      72   1
## 73      73   1
## 74      74   2
## 75      75   1
## 76      76   2
## 77      77   1
## 78      78   1
## 79      79   1
## 80      80   1
## 81      81   2
## 82      82   1
## 83      83   1
## 84      84   1
## 85      85   1
## 86      86   1
## 87      87   1
## 88      88   1
## 89      89   1
## 90      90   1
## 91      91   1
## 92      92   1
## 93      93   1
## 94      94   1
sort(aggregate(seq_abun$abundance.Freq, by=list(OTU), FUN=sum)[,2]) # sort by OTU abundance
##  [1]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## [20]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## [39]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## [58]   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2
## [77]   3   3   3   3   3   4   4   5   5   5   7   8   9  10  11  15  17 285
plot.phylo(as.phylo(tree), direction = "downwards", show.tip.label = T, tip.color = rainbow(length(unique(OTU)))[OTU])

As you can see 94 unique OTUs would be formed if we would cluster the reads at 97% identity.

What would this scattering of the sequences have for implications in downstream analysis of the data For example in terms of alpha diversity

The error model already showed that the actual sequence error rate was probably higher than we could anticipate from the quality profiles. We can now also explore the expected error rate versus the true error rate based on the reads selected.

plot(apply(10^(-as.matrix(PhredQuality(F_ASV_4.FQ@quality))/10),1,sum), read_asv_ham_dist, 
     xlab="Expected Error", ylab="True Errors")

What do you observe in terms of expected error versus true error rate?

6.2 ASV pair merging

To get a single representative ASV sequence, forward and reverse ASV pairs are merged. Only those pairs that can be joined with at least a 20 bases overlap and without any mismatches are accepted. For this we use the mergePairs function.

asv.m <- mergePairs(dadaF = asv.f, derepF = drf, dadaR = asv.r, derepR = drr, returnRejects = T, maxMismatch = 0)
## Duplicate sequences in merged output.
DT::datatable(asv.m)

This table shows the statistics for all the asv pairs. Forward and reverse read mismatching is a common problem of Illumina sequencing. Especially amplicons that have a conserved region in the middle of the sequence appear to be prone for this issue, which is the case for V3V4 amplicons. Therefore significant amount of reads are lost in this process.

6.3 Data - ASV mapping

How many ASVs do we have and how many reads do they represent?

length(asv.m$accept) # number of possible ASV pair combinations
## [1] 3592
sum(asv.m$accept) # number of ASV accepted
## [1] 372
sum(asv.m[asv.m$accept,]$abundance) # number of reads in accepted ASV
## [1] 52071
sum(asv.m[!asv.m$accept,]$abundance) # number of reads in not accepted ASVs
## [1] 13436

Out of 3592 possible combinations 372 valid merged ASVs are generated. A total of 13436 (15%) reads belong to read pairs which do not properly merge. In total 30 % of the reads have been lost during processing of the data.

6.4 Chimera removal

During PCR amplification and bridge amplification in library preparation, potential chimeric sequences can be generated. These hybrid sequences introduce artificial connections in the phylogenetic relationships of the ASVs. As a result, ASVs are typically screened for chimeras and removed from the dataset. V3V4 amplicons are particularly susceptible to chimera formation due to their conserved region in the middle.

Fig 2. Chimeric sequences
Fig 2. Chimeric sequences

DADA2 has a function isBimeraDenovo to detect these chimeric sequences.

seqtab <- makeSequenceTable(asv.m[asv.m$accept,])
bimeras <- isBimeraDenovo(seqtab)
asv.m$bimera <- bimeras[asv.m$sequence]
sum(bimeras)
## [1] 289
sum(na.omit(asv.m$abundance[asv.m$bimera]))
## [1] 8055

A total of 289 ASVs were identified as possible chimeric sequences, representing 8055 reads (15%). When we remove these we will have completed the ASV inference and we will get our final ASV abundances.

seqtab.nochim <- seqtab[,names(which(!bimeras))]
length(seqtab.nochim)
## [1] 83
sum(seqtab.nochim)
## [1] 44016

In the end a total of 83 valid ASVs were inferred representing 44016 reads (57%). This means we lost a total 43% of the reads while processing the data. These are quite common statistics.

How could the algoritm statistics be improved? How could the loss of reads impact interpretation of the data? Note: each individual read was considered an observation

7 Precision and recall

To determine the accuracy and precision of our obtained profile, we compare the ASVs to to the sequences of the 55 reference strains. The reference sequences for each member have been obtained by sanger sequencing. Sanger sequencing does not allow to discriminate between isoforms and therefore the reference sequences contain some ambiguous basecalls. Thus we can expect some discrepancies between the ASV and the reference sequences which are due to errors in the reference rather than to errors in the ASVs.

7.1 Get the mock data

The reference sequences contain the entire 16S gene. In order to compare it to the data we can perform an in silico pcr to determine the expected ASV.

asv.valid.seqs <- names(seqtab.nochim)
# read in the sequences of the mock
mockref <- as.character(readDNAStringSet("./Raw_data/Mock_reference_NoN.fasta"))

insilicopcr <- list()
for(ref in names(mockref)){
  refseq <- mockref[ref]
  start <- vmatchPattern(pattern = "CCTACGGGAGGCAGCAG", subject = refseq, #forward V3 primer seq
                         max.mismatch=1, min.mismatch=0,
                         with.indels=FALSE, fixed=TRUE,
                         algorithm="auto")

  end <- vmatchPattern(pattern = "GGATTAGATACCCTTGTAGTC", subject = refseq, # reverse V4 primer seq
                       max.mismatch=5, min.mismatch=0,
                       with.indels=FALSE, fixed=TRUE,
                       algorithm="auto")

  start.i <- endIndex(start)[[1]]+1
  end.i <- startIndex(end)[[1]]-1

  if (!(isEmpty(start.i) | isEmpty(end.i))){insilicopcr[[ref]] <- Biostrings::subseq(refseq, start = start.i, end = end.i)}
}

mock.comp <- data.frame(read.csv("./Raw_data/Mock.composition.txt", sep=",", header = T, row.names = 1)) # get expected abundances
mock.comp$ASV <- unname(unlist(insilicopcr)) #include expected ASV sequence

Lets check the exact overlap between the expected amplicon sequences and the obtained ASV sequences

mock.comp$ASV %in% names(seqtab.nochim)
##  [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
#DT::datatable(mock.comp)

As we can see many of the MOCK in silico amplicons have identical sequences to the obtained ASVs. Lets see how well they actually match by calculating the pairwise hamming distance between the sample and mock members.

hamming.list <- lapply(asv.valid.seqs, function(x) nwhamming(x, mock.comp$ASV))
hamming.distance.mat <- do.call(rbind, hamming.list)
colnames(hamming.distance.mat) <- rownames(mock.comp)
rownames(hamming.distance.mat) <- asv.valid.seqs
#hamming.distance.mat.long <- setNames(melt(hamming.distance.mat), c('asv', 'mock', 'dist'))

Get the closest relative for each mock and asv

mockrefdist <- data.frame(dist = colMins(hamming.distance.mat),
                          name = names(mockref))

ggplot(mockrefdist, aes(x=name, y=dist)) + 
  geom_bar(stat="identity") +
  coord_flip() + 
  labs(y="Minimum hamming distance", title="Distance mock to representive asv (Recall)", x=NULL)

54 of the mock members have a perfect or good representative ASV in the sample. Only a single species, MC_2 Micrococcus MC_2 does not have a good matching ASV. Recall is calculated as TP/(TP+FN), which in this case is 54/55=98%. So we can say that we have a recall of 98% in detection.

7.2 Precision

Besides recall another import measure is precision. In this case how many of the ASVs are actually derived from the mock. Lets assume that all ASVS with a higher than five hamming distance to the mock members are false positives.

Note, our reference sequences are imperfect so we need allow for some mismatching.

asvrefdist <- data.frame(dist = rowMin(hamming.distance.mat),
                         name = paste0("ASV_",stringr::str_pad(as.character(1:length(asv.valid.seqs)), width=3, pad = "0")))

ggplot(asvrefdist, aes(x=name, y=dist)) + 
  geom_bar(stat="identity") +
  coord_flip() +
  ggplot2::geom_hline(yintercept = 5, col='red') + 
  labs(y="Minimum hamming distance", title="Distance asv to mock representive (Precision)", x=NULL)

# False positive ASVs
sum(asvrefdist$dist<=5)
## [1] 66
# False positive ASVs total reads
#sum(seqtab.nochim[asvrefdist$dist<=5])

As we can see there were quiet some ASVs which are not likely to be derived from the mock members. In binary terms, ie presence or absence, we can say we have a precision of 66/80 which is 83%. Most of these are actually stealthy chimeras.

What would the precision be if we consider the reads rather than the ASVs? Hint sum(seqtab.nochim[asvrefdist$dist<=5])

8 ASV Taxonomy Annotation

Since ASV sequences by themselves are none informative we need to put them into context. We use phylogenetic trees to put the sequences in evolutionary context, while we use taxonomy in order to put them in scientific context.

8.1 Assigning taxonomy

Taxonomy is how we classify organisms into specific groups at certain ranks. Taxonomy tries to capture phylogeny (evolutionary relationships) at a coarse level.

Fig 3. Main taxonomic ranks

Fig 3. Main taxonomic ranks

Lets assign taxonomy to the inferred ASV sequences. DADA2 implements a variant of the (RDP) naive bayes classifier. In short kmer profiles from the ASV sequences are compared to those in a curated taxonomy reference database. In this case the reference database is Silva V132.

tax <- assignTaxonomy(seqs = asv.valid.seqs,"./Raw_data/silva_nr_v132_train_set.fa.gz")
# lets inlcude the Genus taxonomy in the name of the ASV
asvrefdist$name_genus <- paste(asvrefdist$name,tax[asv.valid.seqs,"Genus"])

8.2 Phylogenetic tree

A phylogenetic tree tries to capture the evolutionary relationships of various organisms. We can use the sequence (dis)similarity of the ASVs to infer these phylogenetic relationships. These trees help in generating context for our inferred sequences.

seqs <- c(asv.valid.seqs,mock.comp$ASV)
names(seqs) <- c(asvrefdist$name_genus,mockrefdist$name)
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
## Determining distance matrix based on shared 8-mers:
## ================================================================================
## 
## Time difference of 0.11 secs
## 
## Clustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Aligning Sequences:
## ================================================================================
## 
## Time difference of 0.93 secs
## 
## Iteration 1 of 2:
## 
## Determining distance matrix based on alignment:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Reclustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Realigning Sequences:
## ================================================================================
## 
## Time difference of 0.35 secs
## 
## Iteration 2 of 2:
## 
## Determining distance matrix based on alignment:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Reclustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Realigning Sequences:
## ================================================================================
## 
## Time difference of 0.13 secs
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)

fit = pml(treeNJ, data=phang.align)
fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
#                      rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR$tree <- midpoint(fitGTR$tree)

Lets visualize the tree; sequences in out mock are colored red, ASVs which correspond to a true member of the mock are colored green, while ASVs which are unlikely derived from the mock community (Xeno) are colored blue.

groupInfo <- rep("Silva",length(fitGTR$tree$tip.label))
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist>=5,]$name_genus] <- "Xeno_ASV"
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist<5,]$name_genus] <- "Valid_ASV"
groupInfo[substr(fitGTR$tree$tip.label,1,3)=="MC_"] <- "MOCK"

groupInfo <- split(fitGTR$tree$tip.label, groupInfo)

fitGTR$tree <- groupOTU(fitGTR$tree, groupInfo)

test <- fitGTR$tree
attr(test, "Source") <- attributes(fitGTR$tree)$group


p <- ggtree(test, layout = "rectangular")+ geom_tiplab(size=3, aes(color=Source),key_glyph = "rect", ) + ggplot2::xlim(0, 0.3)
## Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
## Also defined by 'tidytree'
p

We can see that most of the ASVs and mock sequences occur in pairs and their associated taxonomies are mostly congruent. The sequences that do not get classified at Genus level (NA) are all labeled xeno, and are likely stealthy chimeras. However there apear some valid sequences that are not originating from the mock.

What could these sequences represent?

8.3 Abundance correlation

Various steps in generating and processing of the data results in bias in the observed final composition (abundance). Since taxonomy is too coarse grained and potentially biased, lets cluster the ASVs based on the distance in the tree.

dd = as.dist(cophenetic.phylo(fitGTR$tree))
psclust = cutree(hclust(dd), h = 0.05)

I’ve group the tree tips (ASVs, and insilico PCR amplicons) at a semi arbitrary distance of 0.05. How would the results change if I would use a much higher or lower cutoff?

Lets aggregate the ASV abundance and mock composition, based on the clustering of the tree. Once we’ve grouped the ASV and mock sequences we can compare the expected versus the observed relative abundances.

seqtab.nochim.relabu <- seqtab.nochim/sum(seqtab.nochim)

Abundances <- c(seqtab.nochim.relabu, mock.comp$MC3/100)
names(Abundances) <- names(seqs)

cluster_list <- split(names(psclust), psclust)
names(cluster_list) <- unlist(lapply(cluster_list, function(x) names(which.max(Abundances[x])))) # representative sequence

cl.abundance_sample <- aggregate(seqtab.nochim.relabu, by=list(psclust[c(asvrefdist$name_genus)]), FUN=sum)
cl.abundance_mock <- aggregate(mock.comp$MC3/100, by=list(psclust[c(mockrefdist$name)]), FUN=sum)

rownames(cl.abundance_sample) <- names(cluster_list)[cl.abundance_sample[,1]]
rownames(cl.abundance_mock) <- names(cluster_list)[cl.abundance_mock[,1]]

df <- data.frame(label=names(cluster_list),
                 sample=cl.abundance_sample[names(cluster_list),2],
                 mock=cl.abundance_mock[names(cluster_list),2]
                 )
df[is.na(df)] <- 0

ggplot(df, aes(x=sample, y=mock)) + 
  geom_point(size=3) + 
  ggrepel::geom_text_repel(aes(label=label)) + 
  geom_abline(slope = 1) + 
  labs(x="Measured Abundance", y="Predicted Abundance") + 
  #scale_x_log10() + 
  #scale_y_log10() +
  NULL
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

While we can observe some bias for specific groups like Akkermansia and Roseburia overall there is significant correlation between the observed and expected abundance.

How does the observed species specific bias impact the data? For example; We observe a lot more Akkermansia reads than we measured. Note, we measured a fixed set of reads

9 Summary

In this workshop we went through a basic workflow for enriching 16S amplicon data in order to get an annotated microbiome count table.

Take away messages: